WHY Causal ML?
Let's see a generated example:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly
plotly.offline.init_notebook_mode()
np.random.seed(42)
n_samples = 100
health_index = np.random.rand(n_samples) # Random health index values between 0 and 1
treatment_admission = np.where(health_index < 0.4, True, False) # Health index < 0.4 should be treated
treatment_admission = np.where(
(health_index >= 0.4) & (health_index <= 0.6),
np.random.choice([True, False], n_samples, p=[0.5, 0.5]),
treatment_admission
)
survival_percent = np.where(
treatment_admission,
np.random.uniform(0.4, 0.8, n_samples),
np.random.uniform(0.8, 1, n_samples)
)
data = {
'health_index': health_index,
'treatment_admission': treatment_admission,
'survival_percent': survival_percent
}
df = pd.DataFrame(data)
fig = px.scatter(
df,
x="health_index",
y="survival_percent",
color = "treatment_admission",
)
fig.show()
Given the situation where a health_index determines the chance of living and the chance of treatment of a patient, we seldom (or potentially never) gather data for health_index < threshold and NOT treated and health_index > threshold and treated, which are necessary for A/B testing. This is where we would particularly like to approximately the unsampled areas using Causal ML.
We will be exploring Causal ML using the Infant Birth Data of 2022 from CDC. And we will be focusing on the effectiveness of Admission_NICU on Infant_Living
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import dowhy
from dowhy import CausalModel
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("Birth_US_2022.csv")
df.columns
There are many columns, let's drop some of them to make it more readable
try:
df = df.loc[:, [
"Infant_Living",
"Admission_NICU",
"Birth_Weight (g)",
"Limb_Reduction_Defect",
"Cleft_Lip",
"Down_Syndrome",
"Suspected_Chromosomal_Disorder",
"Hypospadias",
"APGAR_5min",
"Gastroschisis",
"Omphalocele",
"Cyanotic_Congenital_Heart_Disease",
"Congenital_Diaphragmatic_Hernia",
"Meningomyelocele",
"Anencephaly"
]]
except:
print("Already Dropped!")
print(df.columns)
df_true_sample = df[df['Infant_Living'] == "Y"].sample(n=8000, replace=False)
df_false_sample = df[df['Infant_Living'] == "N"].sample(n=8000, replace=False)
df = pd.concat([df_true_sample, df_false_sample])
df.head()
orig_len = len(df)
BOOL_COLS = [
"Infant_Living",
"Admission_NICU",
"Limb_Reduction_Defect",
"Cleft_Lip",
"Down_Syndrome",
"Suspected_Chromosomal_Disorder",
"Hypospadias",
"Gastroschisis",
"Omphalocele",
"Cyanotic_Congenital_Heart_Disease",
"Congenital_Diaphragmatic_Hernia",
"Meningomyelocele",
"Anencephaly"
]
mapping = {'N': False, 'Y': True, 'U': pd.NA, "C": True, "P": pd.NA}
df[BOOL_COLS] = df[BOOL_COLS].replace(mapping)
df.dropna(inplace = True)
for col in BOOL_COLS:
try:
df[col] = df[col].astype('boolean')
except:
print(col)
print("Non-Null Ratio: ", len(df)/orig_len)
df['Birth_Weight (kg)'] = df['Birth_Weight (g)'].apply(lambda x:x/1000)
df.drop(columns = ['Birth_Weight (g)'], inplace = True)
df
From the result above, we can infer the following dependency graph (with a lot of assumption about diseases that I am not an expert of):
causal_graph = """
digraph {
Infant_Living;
Admission_NICU;
APGAR_5min;
Gastroschisis;
Omphalocele;
Cyanotic_Congenital_Heart_Disease;
Congenital_Diaphragmatic_Hernia;
Weight[label="Birth_Weight (kg)"];
Admission_NICU -> Infant_Living;
APGAR_5min -> Admission_NICU;
Weight -> Infant_Living;
Gastroschisis -> Admission_NICU;
Gastroschisis -> Infant_Living;
Cyanotic_Congenital_Heart_Disease -> Admission_NICU;
Cyanotic_Congenital_Heart_Disease -> Infant_Living;
Congenital_Diaphragmatic_Hernia -> Admission_NICU;
Congenital_Diaphragmatic_Hernia -> Infant_Living;
Omphalocele -> Admission_NICU;
Omphalocele -> Infant_Living;
}"""
model = dowhy.CausalModel(
data=df.reset_index(drop = True),
graph=causal_graph.replace("\n", " "),
treatment="Admission_NICU",
outcome="Infant_Living"
)
model.view_model(size=(12, 8))
Assumptions:
identified_estimand = model.identify_effect(
method_name = "exhaustive-search",
proceed_when_unidentifiable=True,
)
print(identified_estimand)
estimate = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_weighting",
method_params = {"glm_family":sm.families.Binomial()}
)
print(estimate)
# Interpreting Results
interpretation = estimate.interpret(method_name="textual_effect_interpreter")
interpretation
res_placebo=model.refute_estimate(
identified_estimand, estimate,
method_name="placebo_treatment_refuter",
show_progress_bar=True,
placebo_type="permute"
)
print(res_placebo)
Conclusion: The result that **
As stated at the beginning, one needs strong domain knowledge to utilize Causal ML effectively. This study's hypothesis turns out to be wrong in the first place. The fact that Admission_to_NICU correlates negatively with Infant_Living likely means that it is not a treatment for infant living in the first place.
import plotly.express as px
plot_df = df.groupby(["Admission_NICU", "APGAR_5min"])['Infant_Living'].mean().reset_index()
fig = px.line(
plot_df,
x="APGAR_5min",
y="Infant_Living",
color = "Admission_NICU",
)
fig.show()
From the chart above we can see that this looks very different from the generated case in 1 What is Causal Machine Learning (ML) . This indicates that the decision to admit a newborn to NICU is NOT dependent on APGAR. Other variables, crucial in determining the chance of living of a newborn, is not modelled in this study. This would require further research. Regardless, this is still a valuation learning experience.